Appendix C — Assignment C

Instructions

  1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.

  2. Write your code in the Code cells and your answer in the Markdown cells of the Jupyter notebook. Ensure that the solution is written neatly enough to understand and grade.

  3. Use Quarto to print the .ipynb file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: quarto render filename.ipynb --to html. Submit the HTML file.

  4. The assignment is worth 100 points, and is due on Sunday, 7th May 2023 at 11:59 pm.

  5. Five points are properly formatting the assignment. The breakdown is as follows:

  • Must be an HTML file rendered using Quarto (2 pts). If you have a Quarto issue, you must mention the issue & quote the error you get when rendering using Quarto in the comments section of Canvas, and submit the ipynb file. If your issue doesn’t seem genuine, you will lose points.
  • There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 pt)
  • Final answers of each question are written in Markdown cells (1 pt).
  • There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt)

C.1 Regression Problem - Miami housing

C.1.1 Data preparation

Read the data miami-housing.csv. Check the description of the variables here. Split the data into 60% train and 40% test. Use random_state = 45. The response is SALE_PRC, and the rest of the columns are predictors, except PARCELNO. Print the shape of the predictors dataframe of the train data.

(2 points)

C.1.2 Decision tree

Develop a decision tree model to predict SALE_PRC based on all the predictors. Use random_state = 45. Use the default hyperparameter values. What is the MAE (mean absolute error) on test data?

(3 points)

C.1.3 Tuning decision tree

Tune the hyperparameters of the decision tree model developed in the previous question, and compute the MAE on test data. You must tune the hyperparameters in the following manner:

  1. Use GridSearchCV to minimize the 55-fold mean absolute error (MAE).

  2. Use must do a coarse grid search first to get an idea of the domain space where the optimal hyperparameter values lie.

  3. You must follow it up with a finer grid search to get more precise optimal hyperparameter values.

  4. You may decide yourself which hyperparameters you wish to tune. Common sense should help. There is no single correct answer.

The MAE must be less than $66,000. You must show the optimal values of the hyperparameters obtained, and the test MAE.

(3 points for coarse grid search, 3 points for finer grid search, 4 points for reaching the required MAE)

C.1.4 Bagging decision trees

Bag decision trees, and compute the MAE on test data. Use enough number of trees, such that the MAE stabilizes. Other than n_estimators, use default values of hyperparameters.

The test MAE must be less than $50,000.

(4 points)

C.1.5 Bagging without bootstrapping

Bag decision trees without bootstrapping, i.e., put bootstrap = False while bagging the trees, and compute the MAE on test data. Why is the MAE obtained much higher than that in the previous question, but lower than that obtained in C.1.2?

(1 point for code, (3 + 3) points for reasoning)

C.1.6 Tuning bagged tree model

C.1.6.1 Approaches

There are two approaches for tuning a bagged tree model:

  1. Out of bag predicition

  2. KK-fold cross validation using GridSearchCV.

What is the advantage of each approach over the other, i.e., what is the advantage of the out-of-bag approach over KK-fold cross validation, and what is the advantage of KK-fold cross validation over the out-of-bag approach?

(3 + 3 points)

C.1.6.2 Tuning the hyperparameters

Tune the hyperparameters of the bagged tree model developed in C.1.4. You may use either of the approaches mentioned in the previous question. Show the optimal values of the hyperparameters obtained. Compute the MAE on test data with the tuned model. Your MAE on test data must be less than $46,000. However, you cannot use the test data to tune the hyperparameters.

It is up to you to pick the hyperparameters and their values in the grid.

(10 points)

C.1.7 Bagging feature importance

Arrange and print the predictors in decreasing order of importance.

(4 points)

C.1.8 Random forest

C.1.8.1 Tuning random forest

Tune a random forest model to predict SALE_PRC, and compute the MAE on test data. The MAE must be less than $46,000.

It is up to you to pick the hyperparameters and their values in the grid.

(10 points)

C.1.8.2 Feature importance

Arrange and print the predictors in decreasing order of importance.

(4 points)

C.1.8.3 Random forest vs bagging: max_features

Note that the max_features hyperparameter is there both in the RandomForestRegressor() function and the BaggingRegressor() function. Does it have the same meaning in both the functions? If not, then what is the difference?

Hint: Check scikit-learn documentation

(1 + 3 points)

C.2 Classification - Term deposit

The data for this question is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls, where bank clients were called to subscribe for a term deposit.

There is a train data - train.csv, which you will use to develop a model. There is a test data - test.csv, which you will use to test your model. Each dataset has the following attributes about the clients called in the marketing campaign:

  1. age: Age of the client

  2. education: Education level of the client

  3. day: Day of the month the call is made

  4. month: Month of the call

  5. y: did the client subscribe to a term deposit?

  6. duration: Call duration, in seconds. This attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for inference purposes and should be discarded if the intention is to have a realistic predictive model.

(Raw data source: Source. Do not use the raw data source for this assignment. It is just for reference.)

C.2.1 Data preparation

Convert all the categorical predictors in the data to dummy variables. Note that month and education are categorical variables.

(2 points)

C.2.2 Decision tree

Develop and tune a decision tree model to predict the probability of a client subscribing to a term deposit based on age, education, day and month. The model must have:

  1. Minimum overall classification accuracy of 70% among the classifcation accuracies on train.csv, and test.csv.

  2. Minimum recall of 60% among the recall on train.csv, and test.csv.

Print the accuracy and recall for both the datasets - train.csv, and test.csv.

Note that:

  1. You cannot use duration as a predictor. The predictor is not useful for prediction because its value is determined after the marketing call ends. However, after the call ends, we already know whether the client responded positively or negatively.

  2. You are free to choose any value of threshold probability for classifying observations. However, you must use the same threshold on both the datasets.

  3. Use cross-validation on train data to optimize the model hyperparameters.

  4. Using the optimal model hyperparameters obtained in (iii), develop the decision tree model. Plot the 5-fold cross-validated accuracy and recall against decision threshold probability. Tune the decision threshold probability based on the plot, or the data underlying the plot to achieve the required trade-off between recall and accuracy.

  5. Evaluate the accuracy and recall of the developed model with the tuned decision threshold probability on both the datasets. Note that the test dataset must only be used to evaluate performance metrics, and not optimize any hyperparameters or decision threshold probability.

(14 points - 4 points for tuning the hyperparameters, 4 points for making the plot, 4 points for tuning the decision threshold probability based on the plot, and 2 points for printing the accuracy & recall on both the datasets)

Hint: Restrict the search for max_depth to a maximum of 25, and max_leaf_nodes to a maximum of 50. Without this restriction, you may get a better recall for threshold probability = 0.5, but are likely to get a worse trade-off between recall and accuracy.

It is up to you to pick the hyperparameters and their values in the grid.

C.2.3 Random forest

Develop and tune a random forest model to predict the probability of a client subscribing to a term deposit based on age, education, day and month. The model must have:

  1. Minimum overall classification accuracy of 75% among the classifcation accuracies on train.csv, and test.csv.

  2. Minimum recall of 60% among the recall on train.csv, and test.csv.

Print the accuracy and recall for both the datasets - train.csv, and test.csv.

Note that:

  1. You cannot use duration as a predictor. The predictor is not useful for prediction because its value is determined after the marketing call ends. However, after the call ends, we already know whether the client responded positively or negatively.

  2. You are free to choose any value of threshold probability for classifying observations. However, you must use the same threshold on both the datasets.

  3. Use cross-validation on train data to optimize the model hyperparameters.

  4. Using the optimal model hyperparameters obtained in (iii), develop the decision tree model. Plot the cross-validated accuracy and recall against decision threshold probability. Tune the decision threshold probability based on the plot, or the data underlying the plot to achieve the required trade-off between recall and accuracy.

  5. Evaluate the accuracy and recall of the developed model with the tuned decision threshold probability on both the datasets. Note that the test dataset must only be used to evaluate performance metrics, and not optimize any hyperparameters or decision threshold probability.

(12 points - 4 points for tuning the hyperparameters, 3 points for making the plot, 3 points for tuning the decision threshold probability based on the plot, and 2 points for printing the accuracy & recall on both the datasets)

Hint: Restrict the search for max_depth to a maximum of 25, and max_leaf_nodes to a maximum of 45. Without this restriction, you may get a better recall for threshold probability = 0.5, but are likely to get a worse trade-off between recall and accuracy.

It is up to you to pick the hyperparameters and their values in the grid.

C.3 Predictor transformations in trees

Can a non-linear monotonic transformation of predictors (such as log(), sqrt() etc.) be useful in improving the accuracy of decision tree models?

(3 points for answer)